Text Structure Aiming at Machine Translation Text Structure Aiming at Machine Translation


  • Horacio Saggion

STRUCTURE ASSEMBLER STRUCTURE ASSEMBLER STRUCTURE SPAN Figure 6: Meaning Representation Construction These processes operate on the following components of the meaning representation: Propositions: are produced as a result of syntactic and semantic analysis. Semantic and Syntactic Signals: guide the coherence assembler in the selection of the coherence relations and also in deciding where a coherent span ends. Syntactic signals include discourse markers that directly signal the structure of the discourse [Hirschberg and Litman, 1993]. These markers are the primary indication of the presence of a coherence relation in the text. Tense, aspect and semantic information attached to lexical items provide a means to decide about the limits of a text span [Grosz and Sidner, 1986]. 8 Partial Structure: is used to store propositions and segments already linked and waiting for additional process. When processing a proposition Pk , two problems must be resolved: (a) decide to which text segment the proposition Pk will be attached; (b) decide on how the attachment to a segment will be done. Propositions must be temporarily saved until a decision is made. Coherence Rules: de ne conditions that propositions must satisfy in order to be linked together by a coherence relation. Coherent Span: is a group of propositions related by coherence relations. It carries informational content associated with one of the Informational Categories earlier presented. 5 Detailed Example Figure 7 shows the structure produced as a result of the analysis of the example from Figure 2. The main processes that led to this structure are: ENABLE RECOMENDATIONS ELABORATION OBJETIVES (2) ELABORATION PARALLEL (3) (4) (5) (6) (7) (8) (1) BACKGROUND Figure 7: Text Structure Breaking each sentence into propositions: using syntactic and semantic analysis. Determining references for de nite anaphora: the noun phrase \este trabalho" in proposition (2) is resolved using speci c knowledge about abstracts. The corresponding de nite noun phrase is \this paper". 9 Various entities only make sense in the context of abstracts. These entities include\the authors", \the paper", \the work", \the objective" and the like. This informationis included in the knowledge base system and is very useful when looking for anantecedent for a de nite noun phrase.The noun phrase \este tipo de trocador" in proposition (5) is resolved using thepreceding discourse. The antecedent is \trocadores de calor compactos" introducedin proposition (1). The noun phrase \o m etodo recomendado" is also resolved usingthe previous discourse.Determining the limits of each text span: in proposition (1) the use of the verbal form\s~ao" carries semantical information about general facts (one entity is \de ned").\Is-a" sentences are usually analysed in this form [Sidner, 1978]. So proposition (1) isclassi ed as background. In proposition (2) the verb \apresentar" is used. This verbcarries, in general, purpose or objective information [Jordan, 1991]. Additionally, thenoun phrase \este trabalho", which was found to mean \this paper", is acting assubject in the sentence. Taking into account the fact that a \paper" has an objective,we can deduce that the proposition really marks the objective of the paper. So theObjective category is selected and it spans up to proposition (5). In proposition (6)the item \recomendado" marks the beginning of a new text span which is classi edas Recommendation. Figure 7 shows the limits of each text span.Determining Coherence Relations: syntactic marks guide the selection of coherencerelations. For example propositions from (3) to (5) are linked by coordination, syntac-tically indicated by commas and by the conjunction \e"; this could mark a Parallelor a Sequence relation. But note that the same argument, \este tipo de trocador", isused in the three propositions, which signals a preference for a Parallel relation. Theother CoherenceRelations from the abstract shown in Figure 2 are shown in Figure7.6 ConclusionsTraditional approaches to machine translation have usually neglected the problem of textstructure and the source input was treated as a disconnected sequence of sentences. As aresult, the representation used by these approaches were not able to capture and to makeuse of the coherence phenomena present in the input.We are concentrated on the speci cation and construction of a meaning representationof abstracts from scienti c papers in Portuguese. This representation must capture theinformational content, the coherence relations and the propositional content of the inputtext. We believe that this representation is appropriate for machine translation because itcopes not only with the message which is being conveyed, but also with the structure ofthe text. Representing the linguistic structure of the text enables a generator program tochoose the super cial forms in order to correctly express the message in the target language,preserving the original structure of the text. Several steps are involved in the constructionof such a representation: syntactic analysis, semantic interpretation, anaphora resolution,10 determination of text spans and determination of coherence relations. We are working witha set of these relations, which were de ned according to the phenomena observed in thecorpus. 